Finding and removing duplicate records
Problem
You want to find and/or remove duplicate entries from a vector or data frame.
Solution
With vectors:
# Generate a vector set.seed(158) x <- round(rnorm(20, 10, 5)) # 14 11 8 4 12 5 10 10 3 3 11 6 0 16 8 10 8 5 6 6 # For each element: is this one a duplicate (first instance of a particular value not counted) duplicated(x) # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE #[13] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE # The values of the duplicated entries # note that '6' appears in the original vector three times, and so it has two entries here x[duplicated(x)] # [1] 10 3 11 8 10 8 5 6 6 # Duplicated entries, without repeats unique(x[duplicated(x)]) # 10 3 11 8 5 6 # The original vector with all duplicates removed. These do the same: unique(x) x[!duplicated(x)] # 14 11 8 4 12 5 10 3 6 0 16
With data frames:
# A sample data frame: df <- read.table(header=T, con <- textConnection(' label value A 4 B 3 C 6 B 3 B 1 A 2 A 4 A 4 ')) close(con) # Is each row a repeat? duplicated(df) # FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE # Show the repeat entries df[duplicated(df),] # label value # B 3 # A 4 # A 4 # Show unique repeat entries unique(df[duplicated(df),]) # label value # B 3 # A 4 # Original data with repeats removed. These do the same: unique(df) df[!duplicated(df),] # label value # A 4 # B 3 # C 6 # B 1 # A 2